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We study regularized estimation in high-dimensional longitudinal 
classification problems, using the lasso and fused lasso regularizers. The 
constructed coefficient estimates are piecewise constant across the time 
dimension in the longitudinal problem, with adaptively selected change 
points (break points). We present an efficient algorithm for computing 
such estimates, based on proximal gradient descent. We apply our pro¬ 
posed technique to a longitudinal data set on Alzheimer’s disease from 
the Cardiovascular Health Study Cognition Study, and use this data set 
to motivate and demonstrate several practical considerations such as 
the selection of tuning parameters, and the assessment of model stabil¬ 
ity. 


1. Introduction. In this paper, we study longitudinal classification problems in which the num¬ 
ber of predictors can exceed the number of observations. The setup: we observe n individuals across 
discrete timepoints t = 1,... T. At each timepoint we record p predictor variables per individual, and 
an outcome that places each individual into one of K classes. The goal is to construct a model that 
predicts the outcome of an individual at time t+ A, given his or her predictor measurements at time t. 
Since we allow for the possibility that p > n, regularization must be employed in order for such a pre¬ 
dictive model (e.g., based on maximum likelihood) to be well-defined. Borrowing from the extensive 
literature on high-dimensional regression, we consider two well-known regularizers, each of which 
also has a natural place in high-dimensional longitudinal analysis for many scientific problems of 
interest. The first is the lasso regularize^ which encourages overall sparsity in the active (contribut¬ 
ing) predictors at each timepoint; the second is the fused lasso regularizer, which encourages a notion 
of persistence or contiguity in the sets of active predictors across timepoints. 

Our work is particularly motivated by the analysis of a large data set provided by the Cardio¬ 
vascular Health Study Cognition Study (CHS-CS). Over the past 24 years, the CHS-CS recorded 
multiple metabolic, cardiovascular and neuroimaging risk factors for Alzheimer’s disease (AD), as 
well as detailed cognitive assessments for people of ages 65 to 110 years old [Lopez et al., 2003, Sax¬ 
ton et al., 2004, Lopez et al., 2007]. As a matter of background, the prevalence of AD increases at 
an exponential-like rate beyond the age of 65. After 90 years of age, the incidence of AD increases 
dramatically, from 12.7% per year in the 90-94 age group, to 21.2% per year in the 95-99 age group, 
and to 40.7% per year for those older than 100 years [Evans et al., 1989, Fitzpatrick et al., 2004, 
Corrada et al., 2010]. Later, we examine data from 924 individuals in the Pittsburgh section of the 
CHS-CS. The objective is to use the data available from subjects at t years of age to predict the onset 
of AD at t + 10 years of age (A = 10). For each age, the outcome variable assigns an individual to one 
of 3 categories: normal, dementia, death. Refer to Section 3 for our analysis of the CHS-CS data set. 

* These authors contributed equally to this work 
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1.1. The multinomial fused lasso model. Given the number of parameters involved in our general 
longitudinal setup, it will be helpful to be clear about notation: see Table 1. Note that the matrix Y 
stores future outcome values, i.e., the element Y lt records the outcome of the ith individual at time 
t + A, where A > 0 determines the time lag of the prediction. In the following, we will generally use 
the symbol to denote partial indexing; examples are X l . t , the vector of p predictors for individual 

1 at time t, and f. t k, the vector of p multinomial coefficients at time t and for class k. Also, Section 

2 will introduce an extension of the basic setup in which the number of individuals can vary across 
timepoints, with rtt denoting the number of individuals at each timepoint t - 1,... T. 


Parameter 

Meaning 

i = l,...n 

index for individuals 

j = l,--P 

index for predictors 

t=l,...T 

index for timepoints 

k = l ,...K 

index for outcomes 

Y 

nxT matrix of (future) outcomes 

X 

nx p xT array of predictors 

ho 

T x (K - 1) matrix of intercepts 

p 

p x T x (K - 1) array of coefficients 


Table 1 

Notation used throughout the paper. 


At each timepoint t = 1,... T, we use a separate multinomial logit model for the outcome at time 
t + A: 


( 1 ) 


, nY lt = i\Xi. t = x) rT 

log -— = Potl + P.aX 


log 


P(Y lt = K\X l . t =x) 
P(Y it = 2\Xi. t = x) 
P(Y lt = K\X l . t = x) 


- p0t2 + P. t 2 x 


, P(Y lt =K- l\Xi. t =x) n rT 

The coefficients are determined by maximizing a penalized log likelihood criterion, 


( 2 ) 


(fio, j3) e argmax £(fo,P) - XiPiip) - 

h.P 


where £(po,p) is the multinomial log likelihood, 

ap 0 ,P) = 'Ltp(Yit\x i . t ), 

t=li=l 


P i is the lasso penalty [Tibshirani, 1996], 

p T K -1 

pm=LLL \Pjtk\- 

j=u=ik=i 

and P 2 is a version of the fused lasso penalty [Tibshirani et al., 2005] applied across timepoints, 

p T-1K-1 

^)=EE E \Pjtk-Pj( t+ i)k\. 

j= 1 1 =1 k=i 
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(The element notation in (2) emphasizes the fact that the maximizing coefficients (Po, /)) need not 
be unique, since the log likelihood ((fio,(3) need not be strictly concave—e.g., this is the case when 
p > n.) 

In broad terms, the lasso and fused lasso penalties encourage sparsity and persistence, respec¬ 
tively, in the estimated coefficients p. A larger value of the tuning parameter Ai > 0 generally corre¬ 
sponds to fewer nonzero entries in ft; a larger value of the tuning parameter A 2 > 0 generally corre¬ 
sponds to fewer change points in the piecewise constant coefficient trajectories [5j.k , across t = 1,...T. 
We note that the form the log likelihood £(po,/3) specified above assumes independence between the 
outcomes across timepoints, which is a rather naive assumption given the longitudinal nature of our 
problem setup. However, this naivety is partly compensated by the role of the fused lasso penalty, 
which ties together the multinomial models across timepoints. 

It helps to see an example. We consider a simple longitudinal problem with n = 50 individuals, 
T - 15 timepoints, and K = 2 classes. At each timepoint we sampled p = 30 predictors independently 
from a standard normal distribution. The true (unobserved) coefficient matrix p is now 30 x 15; we 
set Pj. = 0 for j = 1...27, and set the 3 remaining coefficients trajectories to be piecewise constant 
across t = 1,... 15, as shown in the left panel of Figure 1. In other words, the assumption here is 
that only 3 of the 30 variables are relevant for predicting the outcome, and these variables have 
piecewise constant effects over time. We generated a matrix of binary outcomes Y according to the 
multinomial model (1), and computed the multinomial fused lasso estimates po,p in (2). The right 
panel of Figure 1 displays these estimates (all but the intercept Pq) across t = 1,... 15, for a favorable 
choice of tuning parameters Ai = 2.5, A 2 = 12.5; the middle plot shows the unregularized (maximum 
likelihood) estimates corresponding to Ai = A 2 = 0. 



Time 


Time 


Time 


FlG 1. A simple example with n = 50, T = 15, K — 2, and p = 30. The left panel displays the true coefficent trajectories 
across timepoints t~ 1,... 15 (only 3 of the 30 are nonzero); the middle panel shows the (unregularized) maximum likelihood 
estimates; the right panel shows the regularized estimates from (1), with = 2.5 and A 2 = 12.5. 


Each plot in Figure 1 has a y-axis that has been scaled to suit its own dynamic range. We can see 
that the multinomial fused lasso estimates, with an appropriate amount of regularization, pick up 
the underlying trend in the true coefficients, though the overall magnitude of coefficients is shrunken 
toward zero (an expected consequence of the £\ penalties). In comparison, the unregularized multi- 
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nomial estimates are wild and do not convey the proper structure. From the perpsective of prediction 
error, the multinomial fused lasso estimates offer a clear advantage, as well: over 30 repetitions from 
the same simulation setup, we used both the regularized coefficient estimates (with X\ - 2.5 and 
A 2 = 12.5) and the unregularized estimates to predict the outcomes on an i.i.d. test set. The average 
prediction error using the regularized estimates was 0.114 (with a standard error of 0.014), while 
the average prediction error from the unregularized estimates was 0.243 (with a standard error of 
0 . 022 ). 

1.2. Related work and alternative approaches. The fused lasso was first introduced in the statis¬ 
tics literature by Tibshirani et al. [2005], and similar ideas based on total variation, starting with 
Rudin et al. [1992], have been proposed and studied extensively in the signal processing community. 
There have been many interesting statistical applications of the fused lasso, in problems involving 
the analysis of comparative genomic hybridization data [Tibshirani and Wang, 2008], the modeling 
of genome association networks [Kim and Xing, 2009], and the prediction of colorectal cancer [Lin 
et al., 2013]. The fused lasso has in fact been applied to the study of Alzheimer’s disease in Xin et al. 
[2014], though these authors consider a very different prediction problem than ours, based on static 
magnetic resonance images, and do not have the time-varying setup that we do. 

Our primary motivation, which is the focus of Section 3, is the problem of predicting the status 
of an individual at age t +10 years from a number of variables measured at age t. For this we use 
the regularized multinomial model described in (1), (2). We encode K = 3 multinomial categories as 
normal, dementia, and death: these are the three possible outcomes for any individual at age t + 10. 
We are mainly interested in the prediction of dementia; this task is complicated by the fact that 
risk factors for dementia are also known to be risk factors for death [Rosvall et al., 2009], and so to 
account for this, we include the death category in the multinomial classification model. An alternate 
approach would be to use a Cox proportional hazards model [Cox, 1972], where the event of interest 
is the onset of dementia, and censorship corresponds to death. 

Traditionally, the Cox model is not fit with time-varying predictors or time-varying coefficients, 
but it can be naturally extended to the setting considered in this work, even using the same regular¬ 
ization schemes. Instead of the multinomial model (1), we would model the hazard function as 

(3) h{t + A|Xj. f - x) = h 0 (t + A)• exp(x T p. t ), 

where peU pxT are a set of coefficients over time, and ho is some baseline hazard function (that does 
not depend on predictor measurements). Note that the hazard model (3) relates the instantaneous 
rate of failure (onset of dementia) at time t + A to the predictor measurements at time t. This is as in 
the multinomial model (1), which relates the outcomes at time t + A (dementia or death) to predictor 
measurements at time t. The coefficients in (3) would be determined by maximizing the partial log 
likelihood with the analogous lasso and fused lasso penalties on ft, as in the above multinomial 
setting (2). 

The partial likelihood approach can be viewed as a sequence of conditional log odds models [Efron, 
1977, Kalbflesich and Prentice, 2002], and therefore one might expect the (penalized) Cox regression 
model described here to perform similarly to the (penalized) multinomial regression model pursued 
in this paper. In fact, the computational routine described in Section 2 would apply to the Cox model 
with only very minor modifications (that concern the gradient computations). A rigorous comparison 
of the two approaches is beyond the scope of the current manuscript, but is an interesting topic for 
future development. 
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1.3. Outline. The rest of this paper is organized as follows. In Section 2, we describe a proxi¬ 
mal gradient descent algorithm for efficiently computing a solution (Po,p) in (1). Next, we present 
an analysis of the CHS-CS data set in Section 3. Section 4 discusses the stability of estimated co¬ 
efficients, and related concepts. In Section 5 we discuss numerous approaches for the selecting the 
tuning parameters Ai, A 2 > 0 that govern the strength of the lasso and fused lasso penalties in (1). In 
Section 6, we conclude with some final comments and lay out ideas for future work. 

2. A proximal gradient descent approach. In this section, we describe an efficient proxi¬ 
mal gradient descent algorithm for computing solutions of the fused lasso regularized multinomial 
regression problem (2). While a number of other algorithmic approaches are possible, such as im¬ 
plementations of the alternating direction method of multipliers [Boyd et al., 2011], we settle on 
the proximal gradient method because of its simplicity, and because of the extremely efficient, direct 
proximal mapping associated with the fused lasso regularizer. We begin by reviewing proximal gra¬ 
dient descent in generality, then we describe its implementation for our problem, and a number of 
practical considerations like the choice of step size, and stopping criterion. 

2.1. Proximal gradient descent. Suppose that g : — R is convex and differentiable, h : IR d — R 

is convex, and we are interested in computing a solution 

x* e argmin g(x) + h(x). 

If h were assumed differentiable, then the criterion f(x) = g(x) + h(x) is convex and differentiable, 
and repeating the simple gradient descent steps 

(4) x + =x-r V/Xx) 

suffices to minimize f , for an appropriate choice of step size r. (In the above, we write x + to denote the 
gradient descent update from the current iterate x.) If h is not differentiable, then gradient descent 
obviously does not apply, but as long as h is “simple” (to be made precise shortly), we can apply 
a variant of gradient descent that shares many of its properties, called proximal gradient descent. 
Proximal gradient descent is often also called composite or generalized gradient descent, and in this 
routine we repeat the steps 

(5) x + = prox hT (x - r Vg(x)) 

until convergence, where prox/, T : is the proximal mapping associated with h (and t), 

1 2 

(6) prox/j T (x) = argmin — ||x-z|| 2 + Mz). 

zeR d 2 t 

(Strict convexity of the above criterion ensures that it has a unique minimizer, so that the proximal 
mapping is well-defined.) Provided that h is simple, by which we mean that its proximal map (6) is 
explicitly computable, the proximal gradient descent steps (5) are straightforward and resemble the 
classical gradient descent analogues (4); we simply take a gradient step in the direction governed 
by the smooth part g, and then apply the proximal map of h. A slightly more formal perspective 
argues that the updates (6) are the result of minimizing h plus a quadratic expansion of g, around 
the current iterate x. 

Proximal gradient descent has become a very popular tool for optimization problems in statistics 
and machine learning, where typically g represents a smooth loss function, and h a nonsmooth reg¬ 
ularizer. This trend is somewhat recent, even though the study of proximal mappings has a long 
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history of in the optimization community (e.g., see Parikh and Boyd [2013] for a nice review paper). 
In terms of convergence properties, proximal gradient descent enjoys essentially the same conver¬ 
gence rates as gradient descent under the analogous assumptions, and is amenable to acceleration 
techniques just like gradient descent (e.g., Nesterov [2007], Beck and Teboulle [2009]). Of course, for 
proximal gradient descent to be applicable in practice, one must be able to exactly (or even approx¬ 
imately) compute the proximal map of h in (6); fortunately, this is possible for many optimization 
problems, i.e., many common regularizers h, that are encountered in statistics. In our case, the prox¬ 
imal mapping reduces to solving a problem of the form 

^ m m-1 

(7) 0 = argmin-||x-0||| + AiE |0d + A 2 E \9i-di+i\. 

8 * i =1 i =1 

This is often called the fused lasso signal approximator (FLSA) problem, and extremely fast, linear¬ 
time algorithms exist to compute its solution. In particular, we rely on an elegant dynamic program¬ 
ming approach proposed by Johnson [2013]. 


2.2. Application to the multinomial fused lasso problem. The problem in (2) fits into the desired 
form for proximal gradient descent, with g the multinomial regression loss (i.e., negative multinomial 
regression log likelihood) and h the lasso plus fused lasso penalties. Formally, we can rewrite (2) as 


( 8 ) 


00, P) G argmin g(/3 0 , /)) + h(Po,P), 


where g is the convex, smooth function 


T n (K -1 

' K -1 

g(Po,P)=ZL\ E = k)(Potk+Xi. t p. t k) +log 

1+ E exp (Poth + Xi.tP.th) 

t=li=l [k=l 

and h is the convex, nonsmooth function 

< h= 1 


p T K -1 p T-lK-1 

iEEE ifrai+AaE E I 

j=u=ik=i j=it=ik=i 


\Pjtk fj(t+i)k\- 


Here we consider fixed values Ai,A 2 > 0. As described previously, each of these tuning parameters 
will have a big influence on the strength of their respective penalty terms, and hence the properties 
of the computed estimate 0o,pf, we discuss the selection of Ai and A 2 in Section 5. We note that the 
intercept coefficients Pq are not penalized. 

To compute the proximal gradient updates, as given in (5), we must consider two quantities: 
the gradient of g, and the proximal map of h. First, we discuss the gradient. As Po e U Tx( - K ~ 1 \ 
P e ^p xTx ( k ~ 1) , we may consider the gradient as having dimension Vg(Po,/3) e j$(p+ 1 ) xTx (-K’- 1 \ We 
will index this as [Vg(fio> fi)]jtk for j = 0,.. .p, t = 1,... T, k = 1,.. .K - 1; hence note that [Vg(Po, P)hotk 
gives the partial derivative of g with respect to Potk, and l^g(Po,P)]jtk the partial derivative with 
respect to Pjtk, for j = l,...p. For generic t,k, we have 


(9) 


[Vg(Po,P)\oth 


n 


I 


-I (Y it = k) + 


expiPotk+Xj.tP.tk) 

^ + 'Lh=l ex V(Poth + X i . t p. t h), 


and for j > 1, 


( 10 ) 


[Vg(Po,P)]jtk = E 


-I <y it = k)x ijt +X; 


L ijt 


i=l V 


exp (Potk+Xj.tP.tk) 

1 + Ef “^exp (p 0 th +X l . t p. th ), 
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It is evident that computation of the gradient requires O(npTK). 

Now, we discuss the proximal operator. Since the intercept coefficients /3o £ [R Tx(A_1) are left unpe¬ 
nalized, the proximal map over /3o just reduces to the identity, and the intercept terms undergo the 
updates 

Potk = Potk-TVJg(Po,P)]otk for t = l,...T,k = l,...K-l. 

Hence we consider the proximal map over ft alone. At an arbitrary input x £ U pxTx( ' K ^ 1 *, this is 

l p T K-l p T K -1 p T-1K-1 

argmin — £ £ £ (%jtk % jtk ) + Ai ii n n x x \Pjtk Pj(t+Dk\> 

zeR p*Tx(K-l) 4T j =lt=1 k=1 j =lt=lk=1 j =lt=lk=1 

which we can see decouples into p(K- 1) separate minimizations, one for each predictor j = l,...p 
and class k = In other words, the coefficients ft undergo the updates 

(11) ft + - k = argmin - £ {(ftj. k -T[Vg(fto,ft)]j.k)-0) + T/li £ \Q t \ + tA 2 £ l^f-^+ll> 

0eR r ^£=1 V ’ t— 1 £=1 

for j = l,...p, ^ 1, 

each minimization being a fused lasso signal approximator problem [Tibshirani et al., 2005], i.e., 
of the form (7). There are many computational approaches that may be applied to such a problem 
structure; we employ a specialized, highly efficient algorithm by Johnson [2013] that is based on 
dynamic programming. This algorithm requires O(T) operations for each of the problems in (11), 
making the total cost of the update O(pTK) operations. Note that this is actually dwarfed by the cost 
of computing the gradient Vg(fto,ft) in the first place, and therefore the total complexity of a single 
iteration of our proposed proximal gradient descent algorithm is O(npTK). 

2.3. Practical considerations. We discuss several practical issues that arise in applying the prox¬ 
imal gradient descent algorithm. 

2.3.1. Backtracking line search. Returning to the generic perpsective for proximal gradient de¬ 
scent as described in Section 2.1, we rewrite the proximal gradient descent update in (5) as 

(12) x + = x-tG t (x), 

where G T (x) is called the generalized gradient and is defined as 

x - prox/j T (x - rVg(x)) 


The update is rewritten in this way so that it more closely resembles the usual gradient update 
in (4). We can see that, analogous to the gradient descent case, the choice of parameter r > 0 in 
each iteration of proximal gradient descent determines the magnitude of the update in the direction 
of the generalized gradient G T (x). Classical analysis shows that if Vg is Lipschitz with constant 
L> 0, then proximal gradient descent converges with any fixed choice of step size r < 1/L across 
all iterations. In most practical situations, however, the Lipschitz constant L of Vg is not known 
or easily computable, and we rely on an adaptive scheme for choosing an appropriate step size at 
each iteration; backtracking line search is one such scheme, which is straightforward to implement 
in practice and guarantees convergence of the algorithm under the same Lipschitz assumption on 
Vg (but importantly, without having to know its Lipschitz constant L). Given a shrinkage factor 
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0 < 7 < 1 , the backtracking line search routine at a given iteration of proximal gradient descent 
starts with r = to (a large initial guess for the step size), and while 

(13) g[x-TG T (x)) >g(x)-TVg(x) T G T (x)+ ^\\G z (x)\\l, 

it shrinks the step size by letting t = jt. Once the exit criterion is achieved (i.e., the above is no 
longer satisfied), the proximal gradient descent algorithm then uses the current value of t to take an 
update step, as in (12) (or (5)). 

In the case of the multinomial fused lasso problem, the generalized gradient is of dimension 
Gjipo, p) e K ( 'P +1)xTx(if - 1) , where 


[G T (Po,p)] 0 .. = Wg(Po,p)h.., 


and 


r n ,a ( j J k ~ Vrox FLSA (Pj. k - r[Vg(Po,P)]j. k ) 

[G T (Po,P)]j.k = - for j = 1 k = 


Here W ox flsa,AP jk - T\Vg(po, P)\ r k) is the proximal map defined by the fused lasso signal approx¬ 
imator evaluated at P r k ~ r[Vg(/)o, p)\j-k, he., the right-hand side in ( 11 ). Backtracking line search 
now applies just as described above. 


2.3.2. Stopping criteria. The simplest implementation of proximal gradient descent would run 
the algorithm for a fixed, large number of steps S. A more refined approach would check a stopping 
criterion at the end of each step, and terminate if such a criterion is met. Given a tolerance level 
e > 0 , two common stopping criteria are then based on the relative difference in function values, as 
in 

, + \f(P + 0 ,P + )-f(Po,P)\ 

stopping criterion 1 : terminate it Ci =-—-< e, 

/ (Po>p) 


and the relative difference in iterates, as in 


stopping criterion 2 : terminate if C 2 = 


\\(p^p + )-(po,P)h 

\\(Po,P)h 


< e. 


The second stopping criterion is generally more stringent, and may be hard to meet in large problems, 
given a small tolerance e. 

For the sake of completeness, we outline the full proximal gradient descent procedure in the no¬ 
tation of the multinomial fused lasso problem, with backtracking line search and the first stopping 
criterion, in Algorithms 1 and 2 below. 


2.3.3. Missing individuals. Often in practice, some individuals are not present at some time- 
points in the longitudinal study, meaning that one or both of their outcome values and predictor 
measurements are missing over a subset of t = 1,... T. Let I t denote the set of completely observed in¬ 
dividuals (i.e., with both predictor measurements and outcomes observed) at time t, and let n t = \Ip- 
The simplest strategy to accomodate such missingness would be to compute the loss function g only 
observed individuals, so that 

T [K -1 / K -1 

g(Po, P) = Yj Y I Y = kKPotk "t X-i-tP-tk) + log 1 + Y ex P(A) th+Xi-tP-th) 

t=li£l t l k=l V h= 1 
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Algorithm 1 Proximal gradient descent for the multinomial fused lasso 

INPUT: Predictors X, outcomes Y, tuning parameter values X\, A 2 , initial coefficient guesses /3 (u, ) > maximum num- 
ber of iterations S, initial step size before backtracking to, backtracking shrinkage parameter y, tolerance e 
OUTPUT: Approximate solution (/!o,/3) 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 

11 

12 

13 

14 


s = 1, C = 00 

while (s < S and C > e) do 

Find t s using backtracking, Algorithm 2 (INPUT: 

Update the intercept: — t s [V^(/3q S— 

for j = l,...p do 

for k = 1,...(K-1) do 

Update pf k =prox FLSA>Ts (M?- 1) -T s [V^(/l< ) s - 1) ,/l< s - 1 ))] y . A ) 

end for 
end for 

Increment s = s + 1 

Compute C = [fi/5^,^ s) ) - «) 

end while 

/0O = /6[) S) J=/ S) 

return (y8o, /8) 


Algorithm 2 Backtracking line search for the multinomial fused lasso 
INPUT: /3 0 ,/3,r 0 ,y 
OUTPUT: r 

1: t = t 0 

2: while (true) do 

3: Compute [G T (j6 0 ,/3)] 0 .. = [Vg(/3 0 ,/3)]o.. 

4: for j = l,...p do 

5: for k = 1) do 

6: Compute [G T (/3 0 ,/3)]^ = [fij. k -prox FLSAT (^.^ -t[V g(p 0 ,p)]j. k )]/T 

7: end for 

8: end for 

9: if ^((/3 0 ,/3) - tG t (/3 0 , /3)) > ^(yS 0 , /3) - t[V^(/3 0 , /3)] r G T (/3 0 ,/3) + g II G T (/3 0 ,/3) II| then 

10: Break 

11: else 

12: Shrink t = yr 

13: end if 

14: end while 
15: return r 


An issue arises when the effective sample size n t is quite variable across timepoints t: in this case, 
the penalty terms can have quite different effects on the coefficients at one time t versus another. 
That is, the coefficients fr.t at a time t in which nt is small experience a relatively small loss term 

\k-i ( K -1 

(14) X ) X ~ k)(f$o t k + Xi. t fi. t k) + \og 1 + ^ expf^ot/i + Xi. t p , t h) r> 

itl t l k=l V h =1 J 

simply because there are fewer terms in the above sum compared to a time with a larger effective 
sample size; however, the penalty term 

p K -1 p K -1 

4-1 YL ^ \Pjtk\ + ^2^ ^ \Pjtk ~ Pj(t+l)k I 

j=lk=l j=lk=l 

remains comparable across all timepoints, regardless of sample size. A fix would be to scale the loss 
term in (14) by n t to make it (roughly) independent of the effective sample size, so that the total loss 
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becomes 


(K-l 


K -1 


(15) g(/3o,P)~Y^ — X) X ~0(Yif - k)(potk +Xi. t [3. i^l + log 1+ ^ expiPoth+Xi.tfi.th) 


t= 1 ra * id 


k /l=l 


This modification indeed ends up being important for the Alzheimer’s analysis that we present in 
Section 3, since this study has a number of individuals in the tens at some timepoints, and in the 
hundreds for others. The proximal gradient descent algorithm described in this section extends to 
cover the loss in (15) with only trivial modifications. 


2.4. Implementation in C++ and R. An efficient C++ implementation of the proximal gradient 
descent algorithm described in this section, with an easy interface to R, is available from the sec¬ 
ond author’s website: http://www.stat.cmu.edu/~flecci. In the future, this will be available as 
part of the R package glmgen, which broadly fits generalized linear models under generalized lasso 
regularization. 


3. Alzheimer’s Disease data analysis. In this section, we apply the proposed estimation method 
to the data of the the Cardiovascular Health Study Cognition Study (CHS-CS), a rich database of 
thousands of multiple cognitive, metabolic, cardiovascular, cerebrovascular, and neuroimaging vari¬ 
ables obtained over the past 24 years for people of ages 65 to 110 years old [Fried et al., 1991, Lopez 
et al., 2007]. 

The complex relationships between age and other risk factors produce highly variable natural 
histories from normal cognition to the clinical expression of Alzheimer’s disease, either as dementia 
or its prodromal syndrome, mild cognitive impairment (MCI) [Lopez et al., 2003, Saxton et al., 2004, 
Lopez et al., 2007, Sweet et al., 2012, Lecci, 2014]. Many studies involving the CHS-CS data have 
shown the importance of a range of risk factors in predicting the time of onset of clinical dementia. 
The risk of dementia is affected by the presence of the APOE*4 allele, male sex, lower education, and 
having a family history of dementia [Fitzpatrick et al., 2004, Tang et al., 1996, Launer et al., 1999]. 
Medical risks include the presence of systemic hypertension, diabetes mellitus, and cardiovascular 
or cerebrovascular disease [Kuller et al., 2003, Irie et al., 2005, Skoog et al., 1996]. Lifestyle factors 
affecting risk include physical and cognitive activity, and diet [Verghese et al., 2003, Erickson et al., 
2010, Scarmeas et al., 2006]. 

A wide range of statistical approaches has been considered in these studies, including exploratory 
statistical summaries, hypothesis tests, survival analyses, logistic regression models, and latent tra¬ 
jectory models. None of these methods can directly accommodate a large number of predictors that 
can potentially exceed the number of observations. A small number of variables was often chosen a 
priori to match the requirements of a particular model, neglecting the full potential of the CHS-CS 
data, which consists of thousands of variables. 

The approach that we introduced in Section 1 can accommodate an array of predictors of arbitrary 
dimension, using regularization to maintain a well-defined predictive model and avoid overfitting. 
Our goal is to identify important risk factors for the prediction of the cognitive status at t + 10 years 
of age (A = 10), given predictor measurements at t years of age, for t = 65,66,... ,98. We use the pe¬ 
nalized log likelihood criterion in (2) to estimate the coefficients of the multinomial logit model in 
(1). The lasso penalty forces the solution to be sparse, allowing us to identify a few important predic¬ 
tors among the thousands of variables of the CHS-CS data. The fused lasso penalty allows for a few 
change points in the piecewise constant coefficient trajectories Pj-k, across t = 65,... 98. Justification 
for this second penalty is based on the scientific intuition that predictors that are clinically important 
should have similar effects in successive ages. 
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3.1. Data preprocessing. We use data from the n = 924 individuals in the Pittsburgh section of 
the CHS-CS, recorded between 1990 and 2012. Each individual underwent clinical and cognitive 
assessments at multiple ages, all falling in the range 65,... 108. The matrix of (future) outcomes Y 
has dimension n x 34: for i = 1,... 924 and t = 65,... 98, the outcome Y lt stores the cognitive status at 
age t + 10 and can assume one of the following values: 



if normal 
if MCI/dementia. 
if dead 


MCI is included in the same class as dementia, as they are both instances of cognitive impairment. 
Hence the proposed multinomial model predicts the onset of MCI/dementia, in the presence of a 
separate death category. This is done to implicitly adjust for the confounding effect of death, as some 
risk factors for dementia are also known to be risk factors for death [Rosvall et al., 2009]. 

The array of predictors X is composed of time-varying variables that were recorded at least twice 
during the CHS-CS study, and time-invariant variables, such as gender and race. A complication in 
the data set is the ample amount of missingness in the array of predictors. We impute missing values 
using a uniform rule for all possible causes of missingness. A missing value at age t is imputed by 
taking the closest past measurement from the same individual, if present. If all the past values 
are missing, the global median from people of age t is used. The only exception is the case of time- 
invariant predictors, whose missing values are imputed by either future or past values, as available. 

Categorical variables with m possible outcomes are converted to m - 1 binary variables and all 
the predictors are standardized to have zero mean and unit standard deviation. This is a standard 
procedure in regularization, as the lasso and fused lasso penalties puts constraints on the size of 
the coefficients associated with each variable [Tibshirani, 1997]. To be precise, imputation of missing 
values and standardization of the predictors are performed within each of the folds used in the cross- 
validation method for the choice of the tuning parameters X\ and A2 (discussed below), and then 
again for the full data set in the final estimation procedure that uses the selected tuning parameters. 

The final array of predictors X has dimension 924 x 1050 x 34, where 1050 is the number of vari¬ 
ables recorded over the period of 34 years of age range. 


3.2. Model and algorithm specification. In the Alzheimer’s Disease application, the multinomial 
model in (1) is determined by two equations, as there are three possible outcomes (normal, MCI/ 
dementia, death); the outcome “normal” is taken as the base class. We will refer to the two equa¬ 
tions (and the corresponding sets of coefficients) as the “dementia vs normal” and “death vs normal” 
equations, respectively. 

We use the proximal gradient descent algorithm described in Section 2 to estimate the coefficients 
that maximize the penalized log likelihood criterion in (2). The initializations (/)q°\ j0 (O) ) are set to be 
zero matrices, the maximum number of iterations is S = 80, the initial step size before backtracking 
is To = 20, the backtracking shrinkage parameter is y = 0.6 and the tolerance of the first stopping 
criterion (relative difference in function values) is e = 0.001. We select the tuning parameters by a 
4-fold cross-validation procedure that minimizes the misclassification error. The selected parameters 
are Ai = 0.019 and Ay = 0.072, which yield an average prediction error of 0.316 (standard error 0.009). 
Section 5 discusses more details on the model selection problem. 

The number zi t of outcomes observed at age t varies across time, for two reasons: first, different 
subjects entered the study at different ages, and second, once a subject dies at time to, we exclude 
them consideration in the model formed at all ages t > 1 0, to predict the outcomes of individuals at 
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age t+ 10. The maximum number of outcomes is 604 at age 88, whereas the minimum is 7 at age 108. 
We resort to the strategy described in Section 2.3.3 and use the scaled loss in (15) to compensate for 
the varying sample sizes. 

3.3. Results. Out of the 1050 coefficients associated with the predictors described above, 148 are 
estimated to be nonzero for at least one time point in the 34 years age range. More precisely, for 
at least one age, 57 coefficients are nonzero in the “dementia vs normal” equation of the predictive 
multinomial logit model, and 124 are nonzero in the “death vs normal” equation. 


Dementia vs normal 


Variable 

Meaning (and coding for categorical variables, before scaling) 

raced.2 

Race: "White" 1, else 0 

cdays59 

Taken vitamin C in the last 2 weeks? (number of days) 

newthg68.1 

How is the person at learning new things wrt 10 yrs ago? "A bit worse" 1, else 0 

estrop39 

If you not currently taking estrogen, have you taken in the past? "Yes" 1, "No" 0 

fear05.1 

How often felt fearful during last week? "Most of the time" 1, else 0 

early39 

Do you usually wake up far too early? "Yes" 1, "No" 0 

gendd 

Gender: "Female" 1, "Male" 0 

hctz06 

Medication: thiazide diuretics w/o K-sparing. "Yes" 1, "No" 0 

raced.1 

Race: "Other (no white, no black)" 1, else 0 

orthos27 

Do you use a lower extremity orthosis? "Yes" 1, "No" 0 

pulse21 

60 second heart rate 

grpsym09.1 

What causes difficulty in gripping? "Pain in arm/hand" 1, else 0 

sick03.2 

If sick, could easily find someone to help? "Probably False" 1, else 0 

digcor 

Digit-symbol substitution task: number of symbols correctly coded 

trust03.3 

There is at at least one person whose advice you really trust. "Probably true" 1, else 0 


Death vs normal 


Variable 

Meaning (and coding for categorical variables, before scaling) 

digcor 

Digit-symbol substitution task: number of symbols correctly coded 

ctime27 

Repeated chair stands: number of seconds 

gendd 

Gender: "Female" 1, "Male" 0 

cis42 

Cardiac injury score 

hurry 5 9.2 

Ever had pain in chest when walking uphill/hurry? "No" 1, else 0 

numcig59 

Number of cigarettes smoked per day 

dig06 

Digitalis medicines prescripted? "Yes" 1, "No" 0 

smoke.3 

Current smoke status: "Never smoked" 1, else 0 

hlthl59.1 

Would you say, in general, your health is.. ? "Fair" 1, else 0 

exer59 

If gained/lost weight, was exercise a major factor? "Yes" 1, "No" 0 

nomeds06 

Number of medications taken 

diabada.3 

ADA diabetic status? "New diabetes" 1, else 0 

anyone 

Does anyone living with you smoke cigarettes regularly? "Yes" 1, "No" 0 

ltaai 

Blood pressure variable: left ankle-arm index 

whmile09.2 

Do you have difficulty walking one-half a mile? "Yes" 1, else 0 


Table 2 


The 15 most important variables in the two separate equations of the multinomial logit model. 

Figure 2 shows the 15 most important variables in the 34 years age range, separately for the two 
equations. The measure of importance is described in detail in Section 4 and is, in fact, a measure of 
stability of the estimated coefficients, across 4 subsets of the data (the 4 training sets used in cross- 
validation). The plots on the left show the relative importance of the 15 variables with respect to the 
most important one, whose importance was scaled to be 100. The plots on the right show, separately 
for the two equations, the longitudinal estimated coefficients for the 15 most important variables, 
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Relative importance: dementia vs normal 
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Relative importance: death vs normal 


A 

p 2 - death vs normal 



W) 

"c 

o 

o3 

o 

O 


1 

100 


CM 

o 


o 

o 



CM 

o — 1 
I 



ctime27 

numcig59 

cis42 

diabada.3 

hlthl 59.1 

dig06 

whmile09.2 

nomeds06 

anyone 

exer59 

Itaai 

hurry59.2 

smoke.3 

gendOl 

digcor 


Fig 2. CHS-CS data analysis. Left: relative importance plots for the 15 most important variables in the “dementia vs 
normal” and “death vs normal” equations of the multinomial logit model. Right: corresponding estimated coefficients. The 
order of the legends follow the order of the maximum / minimum values of the estimated coefficient trajectories. Note that 
some coefficients are estimated to be very close to 0 and the corresponding trajectories are hidden by other coefficients. 


using the data and algorithm specification described above. The meaning of these predictors and the 
coding used for the categorical variables are reported in Table 2. The nonzero coefficients that are 
not displayed in Figure 2 are less important (according to our measure of stability) and, for the vast 
majority, their absolute values are less than 0.1. 

We now proceed to interpret the results, keeping in mind that, ultimately, we are estimating the 
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coefficients of a multinomial logit model and that the outcome variable is recorded 10 years in the 
future with respect to the predictors. For example, an increase in the value of a predictor with positive 
estimated coefficient in the top right plot of Figure 2 is associated with an increase of the (10 years 
future) odds of dementia with respect to a normal cognitive status. In what follows, to facilitate the 
exposition of results, our statements are less formal. 

Inspecting the “dementia vs normal” plot we see that, in general, being Caucasian (race01.2) 
is associated with a decrease in the odds of dementia, while, after the age of 85, fear (fear05.1), 
lack of available caretakers (sick03.2), and deterioration of learning skills (newthg68.1) increase 
the odds of dementia. Variables hctz06 (a particular diuretic) and early39 (early wake-ups) have 
positive coefficients for the age ranges 65,... 78 and 77,... 91, respectively, and hence, if active, they 
account for an increase of the risk of dementia. The “death vs normal" plot reveals the importance 
of several variables in the age range 65,...85: longer time to rise from sitting in a chair (ctime27), 
more cigarettes (numcig59), higher cardiac injury score (cis42) are associated with an increase of the 
odds of death. Other variables in the same age range, with analogous interpretations, but lower im¬ 
portance, are diabada.3 ("new diabetes"’ diagnosis), hlthl59.1 ("fair" health status), dig06 (use of 
Digitalis), whmile09.2 (difficulty in walking). By contrast, in the same age range, good performance 
on the digit-symbol substitution task (digcor) accounts for a decrease in the odds of death. Finally, 
regardless of the age, being a non-smoker (smoke. 3) or being a woman (gendOl) decrease the odds of 
death. 



FIG 3. CHS-CS data analysis. Estimated intercept coefficients in the two separate equations of the multinomial logit model. 


Figure 3 shows the intercept coefficients /3oi and /Jo.2, which, we recall, are not penalized in the 
log likelihood criterion in (2). The intercepts account for time-varying risk that is not explained by 
the predictors. In particular, the coefficients po -2 increases over time, suggesting that an increas¬ 
ing amount of risk of death can be attributed to a subject’s age alone, independent of the predictor 
measurements. 
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3.4. Discussion. The results of the proposed multinomial fused lasso methodology applied to the 
CHS data are broadly consistent with what is known about risk and protective factors for dementia in 
the elderly [Lopez et al., 2013]. Race, gender, vascular and heart disease, lack of available caregivers, 
and deterioration of learning and memory are all associated with an increased risk of dementia. 
The results, however, provide critical new insights into the natural progression of MCI/dementia. 
First, the relative importance of the risk factors changes over time. As shown in Figure 2, with 
the exception of race, risk factors for dementia become more relevant after the age of 85. This is 
critical, as there is increasing evidence [Kuller et al., 2011] for a change in the risk profile for the 
expression of clinical dementia among the oldest-old. Second, the independent prediction of death, 
and the associated risk/protection factors, highlight the close connection between risk of death and 
risk of dementia. That is, performance on a simple, timed test of psychomotor speed (digit symbol 
substitution task) is a very powerful predictor of death within 10 years, as is a measure of physical 
strength/frailty (time to arise from a chair). Other variables, including gender, diabetes, walking and 
exercise, are all predictors of death, but are known, from other analyses in the CHS and other studies, 
to be linked to the risk of dementia. The importance of these risk/protective factors for death is 
attenuated (with the exception of gender) after age 85, likely reflecting survivor bias. Taken together, 
these results add to the growing body of evidence of the critical importance of accounting for mortality 
in the analysis of risk for dementia, especially among the oldest old [Kuller et al., 2011]. 

For our analysis we chose a 10 year time window for risk prediction. Among individuals of age 
65-75, who are cognitively normal, this may be a scientifically and clinically reasonable time window 
to use. However, had we similar data from individuals as young as 45-50 years old, then we might 
wish to choose time windows of 20 years or longer. In the present case, it could be argued that a 
shorter time window might be more scientifically and clinically relevant among individuals over the 
age of 80 years, as survival times of 10 years become increasingly less likely in the oldest-old. 

4. Measures of stability. Examining the stability of variables in a fitted model, subject to small 
perturbations of the data set, is one way to assess variable importance. Applications of stability, 
in this spirit, have recently gained popularity in the literature, across a variety of settings such 
as clustering (e.g., Lange et al. [2004]), regression (e.g., Meinshausen and Buhlmann [2010]), and 
graphical models (e.g., Liu et al. [2010]). Here we propose a very simple stability-based measure 
of variable importance, based on the definition of variable importance for trees and additive tree 
expansions [Breiman et al., 1984, Hastie et al., 2008]. We fit the multinomial fused lasso estimate 
(2) on the data set Xi„, Y;., for i = a subsample of the total individuals 1 and repeat 

this process R times. Let /r r) denote the coefficients from the rth subsampled data set, for r = 1,.. .R. 
Then we define the importance of variable j for class k as 

os) 

for each j = 1 , .. .p and k = which is the average absolute magnitude of the coefficients for 

the jth variable and Mh class, across all timepoints, and subsampled data sets. Therefore, a larger 
value of Ijk indicates a higher variable importance, as measured by stability (not only across sub¬ 
sampled data sets r, but actually across timepoints t, as well). Relative importances can be computed 
by scaling the highest variable importance to be 100, and adjusting the other values accordingly; for 
simplicity we typically consider relative variable importances in favor of absolute ones, because the 
original scale has no real meaning. 

There is some subtlety in the role of the tuning parameters Ai,A 2 used to fit the coefficients p' r) 
on each subsampled data set r = 1 ,...R. Note that the importance measure (16) reflects the impor- 
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tance of a variable in the context of a fitting procedure that, given data samples, produces estimated 
coefficients. The simplest approach would be to consider the fitting procedure defined by the multi¬ 
nomial fused lasso problem (2) at a fixed pair of tuning parameter values Ai,A 2. But in practice, it 
is seldom true that appropriate tuning parameter values are known ahead of time, and one typically 
employs a method like cross-validation to select parameter values (see Section 5 for a discussion of 
cross-validation and other model selection methods). Hence in this case, to determine variable im¬ 
portances in the final coefficient estimates, we would take care to define our fitting procedure in (16) 
to be the one that, given data samples, performs cross-validation on these data samples to determine 
the best choice of Ai,A 2, and then uses this choice to fit coefficient estimates. In other words, for 
each subsampled data set r = 1,...R in (16), we would perform cross-validation to determine tun¬ 
ing parameter values and then compute p (r> as the multinomial fused lasso solution at these chosen 
parameter values. This is more computationally demanding, but it is a more accurate reflection of 
variable importance in the final model output by the multinomial fused lasso under cross-validation 
for model selection. 

The relative variable importances for the CHS-CS data example from Section 3 are displayed in 
Figure 2, alongside the plots of estimated coefficients. Here we drew 4 subsampled data sets, each one 
containing 75% of the total number of individuals. The tuning parameter values have been selected 
by cross-validation. The variable importances were defined to incorporate this selection step into the 
fitting procedure, as explained above. We observe that the variables with high positive or negative 
coefficients for most ages in the plotted trajectories typically also have among the highest relative 
importances. Another interesting observation concerns categorical predictors, which (recall) have 
been converted into binary predictors over multiple levels: often only some levels of a categorical 
predictor are active in the plotted trajectories. 

5. Model selection. The selection of tuning parameters Ai, A2 is clearly an important issue that 
we have not yet covered. In this section, we discuss various methods for automatic tuning parameter 
selection in the multinomial fused lasso model (2), and apply them to a subset of the CHS-CS study 
data of Section 3, with 140 predictors and 600 randomly selected individuals, as an illustration. In 
particular, we consider the following methods for model selection: cross-validation, cross-validation 
under the one-standard-error rule, AIC, BIC, and finally AIC and BIC using misclassification loss 
(in place of the usual negative log likelihood). Note that cross-validation in our longitudinal setting 
is performed by dividing the individuals 1,... n into folds, and, per its typical usage, selecting the 
tuning parameter pair Ai, A2 (over, say, a grid of possible values) that minimizes the cross-validation 
misclassification loss. The one-standard-error rule, on the other hand, picks the simplest estimate 
that achieves a cross-validation misclassification loss within one standard error of the minimum. 
Here “simplest” is interpreted to mean the estimate with the fewest number of nonzero component 
blocks. AIC and BIC scores are computed for a candidate Ai, A2 pair by 

AlCUi, A 2 ) = 2 • loss((j8 0 ,P)x 1 m) + 2 • d£[0 o ,Ph lM ), 

BIC(Ai, A 2 ) = 2 • loss((/3 0 , + logNtot • df((/)o, /5 )ai,a 2 )> 

and in each case, the tuning parameter pair is chosen (again, say, over a grid of possible values) to 
minimize the score. In the above, (po,P)x 1 ,x 2 denotes the multinomial fused lasso estimate (2) at the 
tuning parameter pair Ai,A 2, and IVt 0 t denotes the total number of observations in the longitudi¬ 
nal study, N t ot = riT (or N tot = in the missing data setting). Also, df((ySo,)S) a 1 ,a 2 ) denotes the 

degrees of freedom of the estimate (/3o, /3 )ai,a 2 > and we employ the approximation 

df((/3o,/3)Ai,A 2 ) = # of nonzero blocks in 0o,fih u x 2 , 
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borrowing from known results in the Gaussian likelihood case [Tibshirani and Taylor, 2011, 2012]. 
Finally, loss((^o, P)a 1 ,a 2 ) denotes a loss function considered for the estimate, which we take either 
to be the negative multinomial log likelihood P)a 1 ,a 2 ), as is typical in AIC and BIC [Hastie 

et al., 2008], or the misclassification loss, to put it on closer footing to cross-validation. Note that 
both loss functions are computed in-sample, i.e., over the training samples, and hence AIC and BIC 
are computationally much cheaper than cross-validation. 

We compare these model selection methods on the subset of the CHS-CS data set. The individuals 
are randomly split into 5 folds. We use 4/5 of the data set to perform model selection and subsequent 
model fitting with the 6 techniques described above: cross-validation, cross-validation with the one- 
standard-error rule, and AIC and BIC under negative log likelihood and misclassification losses. To 
be perfectly clear, the model selection techniques work entirely within this given 4/5 of the data 
set, so that, e.g., cross-validation further divides this data set into folds. In fact, we used 4-fold 
cross-validation to make this division simplest. The remaining 1/5 of the data set is then used for 
evaluation of the estimates coming from each of the 6 methods, and this entire process is repeated, 
leaving out each fold in turn as the evaluation set. We record several measures on each evaluation set: 
the misclassification rate, true positive rate in identifying the dementia class, true positive positive 
rate in identifying the dementia and death classes combined, and degrees of freedom (number of 
nonzero blocks in the estimate). Figure 4 displays the mean and standard errors of these 4 measures, 
for each of the 6 model selection methods. 

Cross-validation and cross-validation with the one-standard-error rule both seem to represent a 
favorable balance between the different evaluation measures. The cross-validation methods provide 
a misclassification rate significantly better than that of the null model, which predicts according to 
the majority class (death), they yield two of the three highest true positive rates in identifying the 
dementia class, and perform well in terms of identifying the dementa and death classes combined (as 
do all methods: note that all true positive rates here are about 0.75 or higher). We ended up settling 
cross-validation under the usual rule, rather than the one-standard-error rule, because the former 
achieves the highest true positive rate in identifying the dementia class, which was our primary 
concern in the CHS-CS data analysis. By design, cross-validation with the one-standard-error rule 
delivers a simpler estimate in terms of degrees of freedom (196 for the one-standard-error rule versus 
388 for the usual rule) though both cross-validation models are highly regularized in absolute terms 
(e.g., the fully saturated model would have thousands of nonzero blocks). 

6. Discussion and future work. In this work, we proposed a multinomial model for high¬ 
dimensional longitudinal classification tasks. Our proposal operates under the assumption that a 
sparse number of predictors contribute more or less persistent effects across time. The multinomial 
model is fit under lasso and fused lasso regularization, which address the assumptions of sparsity 
and persistence, respectively, and lead to piecewise constant estimated coefficient profiles. We de¬ 
scribed a highly efficient computational algorithm for this model based on proximal gradient descent, 
demonstrated the applicability of this model on an Alzheimer’s data set taken from the CHS-CS, and 
discussed practically important issues such stability measures for the estimates and tuning param¬ 
eter selection. 

A number of extensions of the basic model are well within reach. For example, placing a group 
lasso penalty on the coefficients associated with each level of a binary expansion for a categorical 
variable may be useful for encouraging sparsity in a group sense (i.e., over all levels of a categorical 
variable at once). As another example, more complex trends than piecewise constant ones may be 
fit by replacing the fused lasso penalty with a trend filtering penalty [Kim et al., 2009, Tibshirani, 
2014], which would lead to piecewise polynomial trends of any chosen order k. The appropriateness 
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Fig 4. Comparison of different methods for selection of tuning parameters Ai,A 2 on the CHS-CS data set. The x-axis in 
each plot parametrizes the 6 different methods considered, which are, from left to right: A1C and BIC under negative log 
likelihood loss, AIC and BIC under misclassification loss, cross-validation, and cross-validation with the one-standard-error 
rule. The upper left plot shows (out-of-sample) misclassification rate associated with the estimates selected by each method, 
averaged over 5 iterations. The segments denote ±1 standard errors around the mean. The red dotted line is the average 
misclassification rate associated with the naive estimator that predicts all individuals as dead (the majority class). The 
upper right and bottom left plots show different measures of evaluation (again, computed out-of-sample): the true positive 
rate in identifying the dementia class, respectively, the true positive rate in identifying the dementia and death classes 
combined. Finally, the bottom right plot shows degrees of freedom (number of nonzero blocks) of the estimates selected by 
each method. 


of such a penalty would depend on the scientific application; the use of a fused lasso penalty assumes 
that the effect of a given variable is mostly constant across time, with possible change points; the use 
of a quadratic trend filtering penalty (polynomial order k = 2) allows the effect to vary more smoothly 
across time. 

More difficult and open-ended extensions concern statistical inference for the fitted longitudinal 
classification models. For example, the construction of confidence intervals (or bands) for selected 
coefficients (or coefficient profiles) would be an extremely useful tool for the practitioner, and would 
offer more concrete and rigorous interpretations than the stability measures described in Section 4. 
Unfortunately, this is quite a difficult problem, even for simpler regularization schemes (such as a 
pure lasso penalty) and simpler observation models (such as linear regression). But recent inferential 
developments for related high-dimensional estimation tasks [Zhang and Zhang, 2011, Javanmard 
and Montanari, 2013, van de Geer et al., 2013, Lockhart et al., 2014, Lee et al., 2013, Taylor et al., 
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2014] shed a positive light on this future endeavor. 
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